-
Notifications
You must be signed in to change notification settings - Fork 321
[code_review] Misc improvements part 4 #5588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[code_review] Misc improvements part 4 #5588
Conversation
We can have better tracking with W&B Weave
Introduces a run_by_diff_id method that retrieves a patch by diff ID from review_data and runs the review process.
Eliminated the abstract version property from GenerativeModelTool and removed the version attribute from CodeReviewTool since it is not used.
Moved suggestion filtering logic from code_review/agent.py to a new suggestion_filtering module. Introduced SuggestionFilteringTool for filtering review comments, updated CodeReviewTool to use the new filterer, and relocated related prompt templates. This improves modularity and separation of concerns for suggestion filtering.
Wrapped comments and rejected examples in <comments-to-filter> and <rejected-examples> tags to improve prompt structure and clarity.
It will be replaced with W&B Weave evaluation pipeline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the code review evaluation infrastructure by replacing old script-based evaluation tools with a more modular architecture and W&B Weave integration for tracking evaluations.
Changes:
- Removes legacy evaluation scripts (
code_review_tool_evaluator.py,code_review_tool_evaluator_report.py) and experimental files - Introduces new modular tools for patch summarization, suggestion filtering, and comment matching
- Adds Jupyter notebooks for dataset creation and evaluation using W&B Weave
- Refactors
CodeReviewToolto use Protocol-based dependency injection for better testability - Updates platform base classes to accept both
strandintforpatch_idparameters
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/code_review_tool_evaluator_report.py | Removed legacy evaluation report generator |
| scripts/code_review_tool_evaluator.py | Removed legacy evaluation script (613 lines) |
| experiments/review_helper_modify_filtering_step.ipy | Removed experimental filtering modification script |
| requirements.txt | Added weave>=0.50.0 for evaluation tracking |
| notebooks/code_review_evaluation.ipynb | New notebook for running W&B Weave evaluations |
| notebooks/code_review_create_dataset.ipynb | New notebook for creating evaluation datasets |
| bugbug/tools/suggestion_filtering/prompts.py | Extracted filtering prompts to dedicated module |
| bugbug/tools/suggestion_filtering/agent.py | New modular suggestion filtering tool |
| bugbug/tools/patch_summarization/prompts.py | Extracted summarization prompts to dedicated module |
| bugbug/tools/patch_summarization/agent.py | New modular patch summarization tool |
| bugbug/tools/comment_matching/prompts.py | New prompts for LLM-based comment matching |
| bugbug/tools/comment_matching/agent.py | New tool for matching generated vs ground truth comments |
| bugbug/tools/code_review/scorer.py | New Weave scorers for evaluation metrics |
| bugbug/tools/code_review/utils.py | Refactored to work with structured comment objects |
| bugbug/tools/code_review/prompts.py | Removed prompts moved to specialized modules |
| bugbug/tools/code_review/agent.py | Refactored to use Protocol-based dependencies |
| bugbug/tools/base.py | Simplified by removing version property and print method |
| bugbug/tools/core/platforms/base.py | Updated signature to accept str or int for patch_id |
| bugbug/tools/core/platforms/phabricator.py | Updated signature to accept str or int for patch_id |
| bugbug/tools/core/platforms/swarm.py | Updated signature to accept str or int for patch_id |
| bugbug/code_search/searchfox_api.py | Made get_file parameter optional with default implementation |
| bugbug/code_search/mozilla.py | Made get_file parameter optional with fallback |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| system_prompt=SYSTEM_PROMPT, | ||
| response_format=ProviderStrategy(AgentResponse), | ||
| ) | ||
|
|
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The create method is missing the @staticmethod decorator. This method should be decorated with @staticmethod to match the pattern used in other similar factory methods in the codebase (e.g., PatchSummarizationTool.create(), SuggestionFilteringTool.create()).
| @staticmethod |
| from pydantic import BaseModel, Field | ||
|
|
||
| from bugbug.tools.base import GenerativeModelTool | ||
| from bugbug.tools.code_review.agent import GeneratedReviewComment |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential circular import issue: this module imports GeneratedReviewComment from bugbug.tools.code_review.agent, while bugbug.tools.code_review.agent imports SuggestionFilteringTool from this module. This circular dependency could cause import errors. Consider moving GeneratedReviewComment to a shared data types module or using TYPE_CHECKING with forward references.
|
|
||
| return result["structured_response"].comments | ||
|
|
||
| def run_by_diff_id(self, diff_id: str) -> list[InlineComment] | None: |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter type for diff_id should be str | int to match the signature of get_patch_by_id which was updated in the base class to accept both str and int. This inconsistency could cause type checking issues.
| def run_by_diff_id(self, diff_id: str) -> list[InlineComment] | None: | |
| def run_by_diff_id(self, diff_id: str | int) -> list[InlineComment] | None: |
| from bugbug.tools.patch_summarization.prompts import PROMPT_TEMPLATE_SUMMARIZATION | ||
|
|
||
|
|
||
| class PatchSummarizationTool(GenerativeModelTool): |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PatchSummarizationTool class is missing proper license header comments. While the file has the MPL license header, the class itself should maintain consistency with other similar tool classes that have docstrings explaining their purpose.
| class PatchSummarizationTool(GenerativeModelTool): | |
| class PatchSummarizationTool(GenerativeModelTool): | |
| """Tool for generating natural-language summaries of code patches. | |
| This tool uses a chat-based large language model to summarize a given | |
| patch set in the context of the target software, producing a concise | |
| explanation suitable for reviewers or other stakeholders. | |
| """ |
|
|
||
| def run( | ||
| self, old_comments: list[MatchingComment], new_comments: list[MatchingComment] | ||
| ) -> list[CommentMatch]: |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run method lacks a docstring explaining its parameters and return value. This is especially important for public API methods. Similar methods in other tool classes (e.g., SuggestionFilteringTool.run) have comprehensive docstrings.
| ) -> list[CommentMatch]: | |
| ) -> list[CommentMatch]: | |
| """ | |
| Find matching code review comments between an old and a new set. | |
| This method sends the provided comments to the underlying language model | |
| agent, which identifies pairs of comments that correspond to each other | |
| across the two versions. | |
| Parameters | |
| ---------- | |
| old_comments: | |
| A list of :class:`MatchingComment` instances representing comments | |
| from the old set (e.g., an earlier revision of a code review). | |
| new_comments: | |
| A list of :class:`MatchingComment` instances representing comments | |
| from the new set (e.g., a later revision of the same code review). | |
| Returns | |
| ------- | |
| list[CommentMatch] | |
| A list of :class:`CommentMatch` objects, each describing a matched | |
| pair of comments between the old and new sets. If either | |
| ``old_comments`` or ``new_comments`` is empty, an empty list is | |
| returned. | |
| """ |
| @@ -0,0 +1,25 @@ | |||
| PROMPT_TEMPLATE_SUMMARIZATION = """You are an expert reviewer for {target_software}, with experience on source code reviews. | |||
| Please, analyze the code provided and report a summarization about the new changes; for that, focus on the coded added represented by lines that start with "+". | |||
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the prompt text. "the coded added" should be "the code added".
| Please, analyze the code provided and report a summarization about the new changes; for that, focus on the coded added represented by lines that start with "+". | |
| Please, analyze the code provided and report a summarization about the new changes; for that, focus on the code added represented by lines that start with "+". |
| @@ -0,0 +1,62 @@ | |||
| from langchain.agents import create_agent | |||
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.
| @@ -0,0 +1,48 @@ | |||
| from langchain.agents import create_agent | |||
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.
| @@ -0,0 +1,156 @@ | |||
| # -*- coding: utf-8 -*- | |||
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new package directory is missing an __init__.py file. Python packages require an __init__.py file to be properly importable. Other package directories in the codebase (e.g., bugbug/tools/code_review/) include this file.
| def create(**kwargs): | ||
| if "llm" not in kwargs: | ||
| from bugbug.tools.core.llms import create_openai_llm | ||
|
|
||
| kwargs["llm"] = create_openai_llm() | ||
|
|
||
| return CommentMatchingTool(**kwargs) |
Copilot
AI
Jan 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normal methods should have at least one parameter (the first of which should be 'self').
| def create(**kwargs): | |
| if "llm" not in kwargs: | |
| from bugbug.tools.core.llms import create_openai_llm | |
| kwargs["llm"] = create_openai_llm() | |
| return CommentMatchingTool(**kwargs) | |
| @classmethod | |
| def create(cls, **kwargs): | |
| if "llm" not in kwargs: | |
| from bugbug.tools.core.llms import create_openai_llm | |
| kwargs["llm"] = create_openai_llm() | |
| return cls(**kwargs) |
These improvements could be reviewed commit by commit.